Efficient dendritic learning as an alternative to synaptic plasticity hypothesis

Synaptic plasticity is a long-lasting core hypothesis of brain learning that suggests local adaptation between two connecting neurons and forms the foundation of machine learning. The main complexity of synaptic plasticity is that synapses and dendrites connect neurons in series and existing experiments cannot pinpoint the significant imprinted adaptation location. We showed efficient backpropagation and Hebbian learning on dendritic trees, inspired by experimental-based evidence, for sub-dendritic adaptation and its nonlinear amplification. It has proven to achieve success rates approaching unity for handwritten digits recognition, indicating realization of deep learning even by a single dendrite or neuron. Additionally, dendritic amplification practically generates an exponential number of input crosses, higher-order interactions, with the number of inputs, which enhance success rates. However, direct implementation of a large number of the cross weights and their exhaustive manipulation independently is beyond existing and anticipated computational power. Hence, a new type of nonlinear adaptive dendritic hardware for imitating dendritic learning and estimating the computational capability of the brain must be built.

www.nature.com/scientificreports/ and post-synaptic neurons, cannot pinpoint the significant imprinted adaptation location, whether it is located at the synapse or at the dendrite. Although there is a consensus on temporal adaptation in synaptic boutons and spines, following their recent stimulation patterns, their time independent imprinted adaptation without further stimulations is in question. In addition, assuming imprinted SP, one cannot exclude from current experiments fast and significant enhanced adaptation in the dendrites connected in series to the synapses.
Dendritic learning. Results of recent experiments indicate that fast and enhanced adaptation occurs when two dendrites are mutually trained, similar to the slow adaptation currently attributed to the synapses 17 . This phenomenon differs from dendritic computation 18 based on static dendritic features. Its timescale depends on the training frequency and can be reduced to several seconds only 19 . Although the results pose a question on SP, current experiment results cannot exclude slow and noisy SP in parallel to fast dendritic adaptation. Experiments also indicate that certain dendrites demonstrate forward and backward action potentials and nonlinear dendritic excitability, which resembles spike waveforms [20][21][22] . We begin with the simulation results, where experimental results supporting adaptation within a dendrite are briefly presented.

Realization of DL by a single neuron.
We recently experimentally examined dendritic adaptation by mutually training two dendrites 17 . However, the adaptation sites along the dendrites were obscure. Here, we assumed an adaptive strength for each dendritic segment, where each segment additionally functioned as a nonlinear amplifier [23][24][25][26][27][28][29][30][31] (Fig. 2a, right). Implementing BP on a tree architecture was simpler as each weight was influenced by an output unit via one route only (Fig. 2a). A weight change was accumulated backward from the output unit along the route, where temporarily only a nodal state and its successive weight were required, no long-term memory was needed. The quantitative results of such a tree BP (TBP) on a feedforward tree network (FFTN) were presented for the , an axon (blue), and multiple synapses (green circles). A subthreshold synaptic stimulation (black) via the left dendrite arrives at the soma after a spike (red) was generated through an above-threshold synaptic stimulation via the right dendrite, thereby strengthening the left synapse (enlarged red circle) via a backpropagation signal (red dashed line). (c) Biological scheme of the right neuron connecting to the left neuron via axon/synapse/dendrite (blue/green-circle/gray). Artificial scheme as in a (bottom).  (Fig. 2b). Each FFTN was trained independently to identify one digit. For the selected digit, the output is trained toward 1; otherwise, toward 0. The predicted test digit was selected as the FFTN with the maximal output value. The training parameter optimization (described in "Methods") resulted in a ~ 0.047 test error. Generalization of each FFTN to several FFTNs trained independently using different initial conditions and with a soft committee output (described in "Methods") resulted in a ~ 0.034 test error. We note, that training only the weights from the input to the hidden units results in an optimized test error greater than 0.47 for fixed, uniform or random, weights to the output unit. This significant increase in the test errors indicates the importance of the training of the entire FFTN (Fig. 2b), besides weights from the inputs functioning similar to SP. This result is much below the success rates of a linear classifier 33 and is attributed to the nonoverlapping receptive fields of the nonlinear hidden units, where each one is influenced by a small subset of the inputs, and for the nonlinear activation functions of the hidden units with fixed output weights.
Training a fully connected architecture network, with 100 hidden units and 10 output units (Fig. 2c) results in a test error of 0.018 only 34 (see "Methods"). Each output unit in this architecture is connected to an input unit via multiple routes, thus violating the tree structure in terms of nodes. Nevertheless, each output unit was connected to a weight via one route only (Fig. 2c), and thus the principle of TBP holds. In general, TBP holds for fully connected input/output layers to their nearby layers and tree structure elsewhere (Fig. 2d).
The biological realization of TBP on these architectures (Fig. 2c) poses the following two conditions: First, each weight that connects the input and hidden nodes must be updated 10 times, according to the current output values of the 10 output nodes. These updates can be realized asynchronously using, for example, different delays for each output unit. Second, error function ǫ is a summation of individual errors of each output unit as exemplified for the quadratic error function, where O i /O desired i denotes the output and desired output, respectively. This property holds also for the cross-entropy cost function used in this study (see "Methods"). Note www.nature.com/scientificreports/ that an output node in Fig. 2c, for instance, can biologically imitate a neuron with a single dendritic tree, using an additional output node connected in series to the current one via a single weight. This additional weight represents the dendritic route. In principle, the entire network ( Fig. 2c) might also be represented by a neuron with a ramified dendrite, however, it requires a more complex structure to the output layer and will be discussed elsewhere.

Realization of input crosses by a single neuron.
Input crosses that represent higher-order correlations among input nodes can enhance success rates [34][35][36][37] . However, their biological realization is questionable as no morphological evidence to multiple connections of several axons to one synapse exists, which might accomplish input crosses 38 . In addition, multiple connections of one axon to a neuron is infrequent 38 and cannot accomplish input crosses. Nevertheless, the byproduct of multiple inputs to a nonlinear dendritic segment amplifier accomplishes input crosses (Fig. 3a). Their order and number are further enhanced by nonlinear dendritic segment amplifiers closer to the soma, which typically incorporate a vanishing number of newer inputs (Fig. 3b). This amplification represents a non-local adaptation mechanism, as dendritic segment amplification is equivalent to simultaneous amplification of influx through all its incoming synapses (Fig. 3c). The addition of 10,000 input crosses among three inputs for each hidden unit, in a micro-canonical manner 34 (described in "Methods"), enhances success rates by ~ 3% for small training datasets comprising 15, 30 and 60 examples per digit (Fig. 3d). Maximization of success rates using multiple input crosses at the maximal training dataset (i.e., 50,000 examples) is a heavy computational task. Nevertheless, a power-law extrapolation of the obtained success rates from small training datasets to the maximal dataset results in a ~ 0.01 error rate (Fig. 3d), which outperforms the extrapolation of the error rate without input crosses, ~ 0.018, 34 which is now directly confirmed (see "Methods"). www.nature.com/scientificreports/ Using input crosses among three inputs achieved slightly better success rates and faster convergence times than using input crosses between two inputs 34 . This observation might indicate the importance of higher-order input crosses, where their improvement might stem from the phenomenon of strong first-order phase transition for systems with higher-order multi-spin interactions 39 .
Hebbian learning using a single neuron. The TBP procedure simplifies the biological realization of DL.
However, the necessity to precisely calculate derivatives of activation functions and their products with nearby weights is beyond known biological hardware capabilities. Thus, using tiny imprecise updates that result in accumulated small additive and multiplicative noise to the TBP procedure is expected to only slightly decrease the success rates. However, their enhancement entails unavoidable significant deterioration in obtained success rates. Here, we have presented another possible solution based on the perceptron local learning algorithm 40 .
The architecture consisted of 10 FFTNs, each of which consisted of seven perceptrons, with an additional 10,000 input crosses (see "Methods"). The output of each FFTN was a committee of the seven perceptrons, connected to the output with unit weights (Fig. 3e). Each FFTN was trained independently using the least action algorithm 41,42 to recognize one digit. For the selected digit, the output is trained to be 1; otherwise, -1. The perceptron learning step was performed only when the number of perceptrons with the correct output was less than 5. The step was realized on the perceptron with a wrong output and a minimal absolute local field. Thus, a test error of ~ 0.029 was obtained (described in "Methods"). Note that the least action algorithm requires the knowledge of output local fields of all perceptrons, a non-local decision, but their number is small.

Experimental results supporting intra dendritic adaptation. Recently, new types of experiments
have been performed 43 , wherein the synaptic connectivity of neuronal cultures is excluded (see "Methods") and a patched neuron is extracellularly stimulated from several sites using a multi-electrode array (Fig. 4a). The experimental results indicate that a neuron functions as a collection of independent threshold units, with a specific spike waveform for each one 43 . Specifically, the neuron is anisotropically activated following the origin of the arriving signals to the soma, via its dendritic trees [43][44][45] , and the neuronal spike waveform varies as a function of the stimulation location (Fig. 4b).
These anisotropic properties can demonstrate fast dendritic adaptation 17 , similar to the slow adaptation mechanism currently attributed to synapses 3,46,47 . We used an online method to identify a pair of differing extraand intra-cellular recorded spike waveforms that represent neuronal activation from two dendritic trees. The training procedure involves pairs of an extracellular stimulation that did not evoke a spike and arrived with a predefined delay, typically a few milliseconds, after (or before) an above-threshold intracellular stimulation. For training at a low frequency (e.g., 1 Hz), a significant effect of adaptation was observed after several minutes and was found to be irreversible for a timescale of tens of minutes 17 . Further, an increase in the training frequency (5 Hz) accelerated neuronal adaptation processes to several seconds only 19 .
The resolution of our experimental setup does not allow to pinpoint the sub-dendritic adaptation sites. Nevertheless, in this work, we presented a support for a dendritic adaptation while two of its branches were trained. We used an online method 43 to identify a pair of two extracellular electrodes with similar intracellularly recorded spike waveforms, but varying neuronal response latency and different critical firing frequencies 45 (Fig. 4c), hence represented neuronal activation via different branches of the same dendrite (Fig. 4a). Finally, the neuron was trained using pairs of extracellular stimulations (Fig. 4d), where the stimulation amplitude threshold of one of the electrodes had changed after training (Fig. 4e).
Results indicate an adaption process while stimulating different extracellular electrodes that represent training different routes of the same dendrite (Fig. 4e) and suggest that the adaptation occurs in a sub-route of the trained dendritic tree. This adaptation process is also supported by preliminary results (not shown) where the threshold of the route which is associated with the first stimulation in a pair (purple electrode in Fig. 4d), remains unchanged after the training process. This phenomenon is the inspiration for the TBP scheme presented in Fig. 2. We note that further investigation of the number of sub-dendritic adaptation sites, amplitudes and timescales demands longer measurements with higher resolution experimental techniques.

Discussion
SP is the core hypothesis of brain learning, and its reality is challenged by the following two aspects: SP as a standalone learning mechanism and in comparison to dendritic learning.
As a standalone mechanism, imprinted SP is a slow and noisy adaptation process, which typically lasts tens of minutes and occurs far from the computational element, namely, the spiking soma. The realization of efficient learning in ANNs using the biological recipe of SP is obscure. In addition, time lags among influx stimulations of the soma via different synapses are a critical parameter that controls the adaptation process. However, these time lags are a function of neuronal response latencies that fluctuate and vary dynamically, following previous activities of connecting neuronal chains 48 . Moreover, synaptic strengths are typically considerably below threshold 49,50 and many coordinated input timings are required to repeatedly reproduce the same desired neuronal outputs.
A pair of neurons are connected using several elements in a series and in particular synapses and dendrites. The long-established hypothesis states the adaptation occurs in synapses, which is generally supported experimentally by training of pre-and post-synaptic neurons. However, this evidence cannot pinpoint the significant imprinted adaptation sites, without tracing the signal along its inter-neuronal route. Our experiments demonstrated significant dendritic adaptation that emerged at least one order of magnitude faster than the common scenarios for imprinted SP. Currently, one cannot exclude slow, moderate, and noisy SP in parallel to the measured dendritic adaptation. Moreover, the number of dendritic branches is in the order of tens 51 www.nature.com/scientificreports/ adaptation of all its incoming synapses (Fig. 3c). This non-local adaptive process is expected to enhance the signal-to-noise ratio in comparison to SP. Learning on dendritic trees, where each weight is connected to an output unit via one route only, represents a step toward a plausible biological realization. Tree architectures, although comprising much lesser number of weights, have been demonstrated for the MNIST database to achieve success rates closer to unity, which were previously obtained using more structured DL architectures. This represents the effectiveness of DL when the number of adaptive parameters is in the order of the number of nodes. The realization of dendritic learning using 10 independent FFTN identifiers, one for each digit, and especially using the Hebbian learning rule, might also lead to a better understanding of the biological credit assignment mechanism 13,53,54 .
The emergence of many input crosses as a byproduct of nonlinear amplification of dendritic segments differentiates between the computational power of a single dendrite or neuron from existing CPUs and GPUs. Each dendrite has thousands of presynaptic inputs that generate an exponential number of input crosses as segment www.nature.com/scientificreports/ signals are propagating toward the soma. For a thousand dendritic inputs, for example, there are O(10 6 ) input crosses of order 2 and O(10 21 ) input crosses of order 7. Assigning independent weights to such a large number of input crosses and manipulating their strengths via the BP procedure is beyond existing and anticipated computational power. Evidently, this large number of cross weights are not independent. Knowledge of dendritic inputs, dendritic local weights, and nonlinear amplifiers determines all current cross weights and nodal responses. This valid mathematical statement calls for modelling dendritic nonlinear amplification, rather than estimating weights independently through BP. In addition, it calls for the learning of the type of the nonlinear nodal activation functions as an additional tunable parameters, which can control the ratios between induced cross weights and their order. We note that the byproduct of a deep architecture scheme, consisting of more hidden layers and nonlinear activation functions, is the emergence of cross weights, however, the manipulation of each one of them independently is difficult. Thus, the current cost function for training neural networks can be improved by training independently many cross weights, where in addition features of the adaptive nonlinear amplifiers are changed following the learning process. Furthermore, nonlinear sub-segment dendritic amplifications result in non-intuitive phenomena such that amplification is sensitive to the order of dendritic segment inputs. In addition, adaptation of one sub-segment due to a nearby input, might have a decaying effect on the strength of subsequent sub-sequences, which in turn might create nontrivial dependencies between multiple tunable parameters beyond current simulation models. Existing computer hardware that differs from exemplified brain dynamics is distant from the possibility of imitating their learning process and estimating their computational capabilities. Finally, the experimental support of dendritic adaptation must be refined, using a tradeoff between higher spatial resolution of dendritic segment measurements and long periods of multiple stimulation scheduling. This type of experiment is expected to verify the possible coexistence of SP alongside dendritic adaptation. Qualitative modeling of the main features of dendritic adaptation and their differentiation among various neurons and dendrites are required for understanding of neural network dynamics and their computational capabilities.

Methods
Architecture and initial weights, Fig. 2c. The feedforward neural network consisted of 784 input units, 2 hidden layers consisting of 100 units each and 10 output units. Weights between successive layers were fully connected. Each unit in the hidden and the output layers had an additional input from a bias unit 34 . We denote by W 1 and W 2 the weights from the input layer to the hidden layer and from the hidden layer to the output layer, respectively. The initial conditions of all weights were randomly chosen from a Gaussian distribution with a zero average and standard deviation (Std) equals 1. All weights were normalized at the initial condition only 19 such that all input weights to each hidden unit had a zero average and Std equals 1. In addition, the initial value of the bias of each weight was set to 1. Architecture and initial weights, Fig. 3d. The architecture of this network was similar to the architecture in Fig. 2c, with only one hidden layer and additional 10,000 input-crosses for each hidden unit.

Input. Each
An addition of 10,000 input-crosses was added to the input, X k,l,j : where k, j and l are random indices in the range [1, 784] with corresponding different pixels X k , X j and X l for a given example. Zero input-crosses in all the train dataset were excluded. Each input-cross was not connected more than once to each hidden unit. After the above-mentioned initial normalization of all weights, the weights of the input-crosses were rescaled: Forward propagation. The output of a unit, j, in the first hidden layer for the mth example, for instance, a 1 j,m , was calculated as: www.nature.com/scientificreports/ where W 1 ij is the weight from the ith input unit to the jth hidden unit, X i is the ith input, and b 1 j is the bias induced on the jth unit in the first hidden layer. z 1 j,m represents the field propagating from the input layer. Each time we calculated the field, z 1 j,m , we subtracted the accumulative average field for the input layer of the previous m − 1 examples, where Amp 1 is a constant representing the amplitude of reduction. Note that z 1 j,m was not modified for m = 1.
The output of the jth unit in the output layer, a 2 j , was calculated as following: where W 2 ij is the weight from the ith unit in the hidden layer to the jth output unit, and b 2 j is the bias induced on the jth output unit.
Back propagation. We used the cross entropy cost function: where y m stands for the desired labels, a m stands for the current 10 output units of the output layer, and η and α are constants. The summation was over all M training examples. The second summation was over all weights of the network.
The backpropagation using the momentum method computes the gradient for each weight with respect to the cost function. The weights and biases were updated according to: where t is a discrete time-step, W are the weights, 1 − α is a regularization constant, µ is the momentum constant and η is the learning rate constant. ∇ W C first is the first computed gradient. V was initialized as: V 0 = −η · ∇ W C first .  Note that in Fig. 3d for 50,000 examples and without input crosses we obtained a test error ε = 0.018, which is consistent with a power law 34  Architecture and initial weights for FFTN, Fig. 2b. The feedforward neural network comprised of 10 identifiers, each consisted of 784 inputs that were divided into groups of 16 consecutive pixels along the rows, one hidden layer consisted of 49 units and one output unit. Each unit in the hidden and the output layers had an additional input from a bias unit. We denote by W 1 and W 2 the weights from the input layer to the hidden layer, and from the hidden layer to the output layer, respectively. The initial conditions of all weights were randomly chosen from a Gaussian distribution with a zero average and Std equals 1. All weights to each hidden unit were normalized 19 such that they had a zero average and Std equals 1.

Figure 3d optimized parameters.
Input. The first hidden layer was not fully connected, therefore the input X for each hidden unit was calculated as: where X j represent the different input groups of jth hidden unit. k = 1,2,.0.16 and p is a number running from 1 to size of input 16 . The average test error for 50,000 examples and 50 epochs was ε = 0.047 and the Std was 0.023. The test error with committee of 6 trained networks was ε = 0.0339 and the Std was 0.025. The parameters were: η = 0.023, µ = 0.998, α = 0.0000002, Amp 1 = 0.1 It was noted that for the case where inputs were projected randomly to each hidden unit, very similar test errors (with less than 1% decrease) were obtained but with increased Std between samples. Similar results were also obtained for a similar architecture with 56 hidden units, where each one is connected to 14 inputs instead of 16. Architecture and initial weights, Fig. 3e. The network comprised of 10 identifiers that each contained 784 input units with additional 10,000 input-crosses for each hidden unit (see Input), and one hidden layer consisting of 7 units each. The input and the hidden layers were fully connected, except the input-crosses. We denote by W 1 and W 2 the weights from the input layer to the hidden layer, and from the hidden layer the output layer, respectively. The initial conditions of W 1 were randomly chosen from a Gaussian distribution with a zero average and Std equals 1. All weights to each hidden unit were normalized 19 such that they had a zero average and Std equals 1.
Note that W 2 weights were set to 1, see Fig. 3e.
Forward and back propagation. The output of a unit, j, in the first hidden layer for the mth example, for instance, a 1 j,m , was calculated as: where W 1 ij is the weight from the ith input unit to the jth hidden unit, X i is the ith input and b 1 j is the bias induced on the jth unit in the first hidden layer. z 1 j,m represents the field propagating from the input layer. The weights were updated according to the following: www.nature.com/scientificreports/ where t is a discrete time-step, W are the weights, 1 − α is a regularization constant, η is a constant learning rate, y stands for the desired labels, and a stands for the current output unit. Note that the update was realized on the perceptron with a wrong output and a minimal absolute local field, and only when the number of perceptrons with the correct output was less than 5.
The average test error for 50,000 examples and 50 epochs was ε = 0.029 and the Std was 0.019. The parameters were: η = 0.05, α = 0.0002.
Test accuracy. The network test accuracy was calculated based on the MNIST dataset for testing, containing Optimization. For a given architecture and number of epochs, the optimization procedure first evaluated the test error over a rough grid of the adjustable parameters, followed by fine-tuning grids with higher resolutions. In cases where a complete optimization over a grid was impossible, we optimized sequentially each parameter over its 1D grid. Nevertheless, we confirmed that a few different sequential orders of the optimized parameters resulted in the same optimized test accuracy and set of parameters.
The optimization was performed independently for each examined dataset size, number of examples and number of epochs. The hyperparameters were optimized using several validation sets. Results for the committee systems were based on the optimized selected parameters for a single system. The optimized parameters were summarized in the presented tables.
We note that cross validation was confirmed using several validation databases consisting each of 10,000 random examples with the same statistics for each label as in the test set. Averaged results had the same Std as reported test errors. Similar results were obtained using a test set with different initial conditions. In addition, preliminary results also indicate that databases consisting of random selected examples, also result in similar test errors.
Committee. The test error was further minimized using a soft committee decision based on several replicas, Nc, of the network, which were trained on the same set of examples but with different initial weights. The result label, j, for the test accuracy is given by: where a L j,s stands for the value of the output label j in output layer L and in replica s (j = 0, 1, …0.9).
Experimental methods. The In-Vitro experimental methods are similar to those of our previous studies 43,45 , and only the modifications are presented.

Data availability
Source data are provided with this paper. All other data that support the plots within this paper and other findings of this study are available from the corresponding author upon reasonable request. A prototype simulation code in matlab for the FFTN is provided with this paper in GitHub. W t+1 = (1 − α)W t + η · y − a · X max j Nc s=1 a L j,s